-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid kernel failures with multiple processes #437
Conversation
@minrk @mpacer @Carreau We traced down this issue during PyCon sprints as one of the reasons for parallel nbconvert / papermill calls failing. Would love your eyes on the PR and to get some momentum on resolving this, plus the PRs in nbconvert associated so we can enable parallel executions end-to-end. Alexander actually has more experience with zeroMQ than myself so it was great he ended up digging into this problem. My main concern is not knowing the trade-offs with original codification of zeromq here and if the changes would have subtle issues that aren't obvious. @alexrudy Do you think we'll need to add an accessor to the active session counter inside the cython cdef or does isolating to individual clients resolve that entirely for the multiprocessing and threaded case?
I disagree there. It's a valid use-case and there are other constraints on problems that would require using multiprocessing with a jupyter client. The test failure is with python 3.4 which most of the jupyter ecosystem tools (or their upstreams) have dropped support for. I was seeing the same failure pattern without any code changes in nbconvert 5.4 when developing. I'd suggest we remove 3.4 support for the next release and not block the PR on kernel timeouts with 3.4 (which is what we did in nbconvert). Side note, I don't have permissions on this repo to even assign reviewers. Would love to have broader permissions in jupyter_client and help out here when possible. |
@MSeal I gave you permissions on this repo and pushed a commit dropping Python 3.4 support. Feel free to make this your first merge if all looks well here. |
@MSeal –
After thinking about this for a while, I didn't go the "reference counting" route. The problem with that approach is that it is possible to start kernel A (probably in a thread), then fork, then start kernel B, then interact with kernel A & kernel B – in this case, the "reference count" for the ZMQ context won't have dropped to zero when the fork happens, and the ZMQ context in the forked process will still be in an invalid state. I tried to demonstrate that in the test Its not really that the context doesn't get "cleaned up", but that no open context can be passed across a process fork. IMO, the "right" solution here really is to use a separate context for each client, which does prevent ZMQ's speedy in-process communication between different clients, but otherwise makes them safe share across processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. Thanks for the detailed explanation and thought @alexrudy
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
Backport PR #437: Avoid kernel failures with multiple processes
This PR adds several new tests which ensure that kernels can be used when started and operated in a multiprocessing context.
Before this change, if the global ZMQ context was initialized before the processed forked, then the child processes might fail. This is easy to do accidentally – run some kernel first in the current process, to check if it works, then farm that same function out to a pool of processes, and things will go poorly. I think this is because ZMQ contexts are not safe after fork. However, switching from using the global ZMQ context to a single ZMQ context per client eliminates the failures.
One could argue that kernels should never be used in multiprocessing contexts, because they are run in a subprocess anyhow, so multiprocessing doesn't provide a better way to escape the gil. However, I can't think of a way to detect that case and raise an error without waiting for a timeout, and the error returned there has to be pretty generic (it is currently
RuntimeError: Kernel didn't respond in 30 seconds
).Sorry to drop this PR in without opening an issue, but the issue itself is hard enough to reproduce that a minimal reproduction was easiest in the context of the test suite. Most of the code here is adding various combinations of parallel execution to the test suite and ensuring that they work.
We discovered this problem when @MSeal and I tried to get parallel uses of papermill running during pycon sprints. The original papermill issue is nteract/papermill#329. The ability to run kernels in parallel without surprising results is also helpful for nbconvert (see jupyter/nbconvert#1018)